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A Method for Coiwjrffipg Higb-Level JUggaage Programs to a Reconfigurabig Para-Ffow ftocgg 2 " 

1 Introduction 

This document describes a method for compiling a subset of a high-level programming language (HIX) 
like C or FORTRAN > extended by pott access functions, to a reconfigurable data-flow processor (RDFP) 
as described in Section 3. The program is transformed to a configuration of the RDFR 

This method can be used as pan of an extended compiler for a hybrid architecture consisting of standard 
host processor and a tcconfiguxable data-flew coprocessor The extended compiler handles a full HLL 
like sxandaid ANSI C It maps suitable program pares like inner loops to tbe coprocessor and (be rest 
of die program to the host processor. It is also possible to map separate program parts to separate 
configurations. However, these extensions are not subject of * fe docuj^t^ MuM&Qtu fy&i&vfamf 

2 Compilation Flow' 



is. However, these extensions are not subject or this document. // m,,.,^^ 



This seed on briefly describes, the phases of the compilation method. 



2.1 Frontcnd 



The compiler uses a standard ftonrend which translates the input program (e. g. a C program) into an in- 
ternal format consisting of an abstract syntax tree CAST) and symbol tables. The frontcnd also performs 
well-known compiler optimizations as constant propagation, dead code elimination, common subexpres- 
sion elimination etc. For details, refer to any compiler construction textbook like [ 1 ]. The SUIF compiler 
LZ] is an example of a compiler providing such a frontend. 

2-2 Control/Dataflow Graph Generation 

Next, the program is mapped to a control/dataflow graph (CDFG) consisting of connected RDFP func- 
tions. This phase is the main subject of this document and presented in Section 4. 



2-3 Configuration Code Generation 

Finally, die last phase directly translates the CDFG to configuration code used to program the RDFP* For 
PACTXFP™ Cores, the configuration code is generated as an NML (Native Mapping Language) file. 

3 Configurable Objects and Functionality of a RDFP 

This section describes the configurable objects and functionality of a RDFR A possible implementation 
of the RDFP architecture is a PACT XPF™ Core, Here we only describe the minimum requirements for 
a RDFP for this compilation method to work* The only data types considered are multi-bit words called 
data and single-bit control signals called events. Data and events are always processed as packers, cf. 
Section 3.2. Event packets are called I -events or 0-evems, depending on their bit-vaJue. 
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3.1 Configurable Objects and Functions 



An RPFP consist? of an anay of configurable objects and a communicarion network. Each object can 
be configured to perform certain functions disced below). Ii performs the same function repeatedly until 
the configuration is changed. The array needs not be complete]? uniform. L e. not ail objects need xo be 
able to perform all functions. E-g., a RAM function can be implemented by a specialized RAM object 
which cannot perform any other functions. It is also possible co combine several objects to a "macro" to 
realize certain functions. Several RAM objects can, e. g. , be combined to realize a RAM function with 
baser storage. 
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Figure I: Functions of an RDFP 

Hie following functions for processing data and event packets can be configured into an RDFP. See Fig. 1 
for a graphical representation. 

* ALU[opcode]: ALUs perform common arithmetical and logical operations cn data. ALU func- 
tions C*opcodea n J must be available for all operations used in the HLL, 1 ALU functions have two 
data inputs A and B, and one data output X. Comparators have an event output U instead of the 
data output. They produce a Invent if the comparison is true, and a O-event otherwise. 
'Otherwise program* containing epanucn* which do am have ALU opcodes in the RDFP must be excluded from the 

supported HLL subset or sabs&acol by -macros" of existing functions. 
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o CNT: A counter funesaon which has data iapus L3, UB and INC (lower boamd, uppr boiaad 
and ineresesat) and dasa output X (eounter value). A packet at event input START starts Ghe 
counter; and event iisput NEXT eauses the generation of the yje*l output value (and eialpiaE events) 
or causes the mmm- to temuiaate if UB is reached. If NEXT is not connected* die counter &mm& 
continuously. He oufput wests U a V, and W have tfog following foncdonalHy: For a couater 
counting N times, N-S O-eveats and one I -event are generated ax oiafiput U- At output V t » i^eveats 
are generated, and at Output W, N 0-evcots and one 3-event are creased. TSie 11 -event at W is onJy 
coaled alter the counter to tsnninsstett i- e. a NEXT event packet was received after the last data 
packet ?m output. 

o RAM[sHze]: The RAM fiincdon spares a fixed number of data words CseemT). &t has a data inptat 
RD andLa data output OUT for reading at address ED, Evens; output B&D signals compileEson of 
the read access. For a wri2e access,- data inputs WR and ££* (addiess and vajue) aad cfaaa eutpuG 
OUT & used. Event ousput BWR signals completion of the write aeeess. ERD and SWR always 
gsneyate O-events. Note that external RAM can be handled ass MM functions e^cdy iafos tawnal 
RAM. 

o GATE: A GATE syiaefareaizfis a data padeet at fapm A bae& and aa went packet at iapuu E, Wfeg© 
bosh inputs have amved, 5hey as^ both consumed. The data packet as copied to output 3t asd «he 
event packet to output U. 

o MIDfc A MUX function has 2 data inputs A and B„ ax* event inpnc SWU md a data onn^ut X- If 
SEL reserves a Q-evenc. Jnput A is copied to ouiput X and input B discarded.' For a 1-evesag, 3 is 
copaed and A discarded. 

o SflERGE: AMERCE function feas 2 data inputs A and B e as event input SEL, and a data output X. 
If SEL receives a O-event, input A is copied to output X, bug input B is not dUeayded, The paefeee 
is left at the foput 8 instead. For a 1 -event, a is eopied and A left at she aropuc 

° OHMUX: A DEMUX taction [has one data input A* an event input SEL. and two dana outpus X 
and Y. If SEL receives a 0-event, input A is copied id output X, and ae> packet is creased at output 
Y. For a l-event* A 5s eopied to Y» sand is© packet Is cueated at output X. 

o MDATA: A MBaTA function nultipfimcs data packets. It has a da£a input A. an event input 
SEL, and a data output X. If SSL seeeives a &-event a data paefcet at A is consumed and espied 
to output X. For all subsequent ^event at SEL, a copy * e f the input data packet is produced as die 
ouiput v/Qihout oonsuraSng new panels al A- Only ifsnother 1 -event amves at SEL, the next data 
packet at A is consumed and copied, 2 

o INFDRT[narae]: Receives data jackets fsom outside the RDFF through input port * a nanie w m& 
copies them lo data output JLJf a packet was received, a 0-^vent is produced at event oatput U. 
toe- (Note that this Junction cam only confia^d at spscnal objects connected to e&tegaafl bosses.) 

o OUTFOKTtname^ Sends data paetets received at data input A to the outside ef the RDFP through 
output port ^ame^ If a pactet was sent, a 0-evenfc is piodueed at eveae output im. (NoU? that 
t&s fuactaon em only be configured at special objects connected Bo exteraai busses,) 

Additionally, fee foUtwiag functions jaas igulate only evrat packets: 
2 Note tSiat this ean be implement |>y a ft^gRCI ^ift spseial pnaparfes on X?? TO . 



EmPf o zeito06/12/2002 14S54 



-0S-DEZ-2002 14:57 POT.-RNUI. P. PIETRUK +49 721 4G930S S.ll 



A Method for Compiling fffgfo-L&veJ language Programs m a ReconSgwzbfe Data-FIow Fteeevsor 5 



Q-FILTER, i-FILTERs A FILTER has aa input g and as output U. A C^HLTER copies a ©sevens 
SiomEtoU,but 1-EVENTs atE are discarded. A i -FILTER copies 1 -events and dlsearis O-evests, 

XNVBKHR: Copies alK eveass finoia inpst E to et^ut U but imsm fits valas, - 

O^ONSTANT^l-CONSTANT: OCONSHttiT copies sil eve0fe from ifipui E to output U, tat 
chasges them" an t© valiae 0. i «CQNST/*2fT changes all to vaJnae I - 

ECOMB; Cossbiaes S¥*o or ©exe caputs m f E2, E3.m, producing © packet at ounpus U. The output 
is a k« if and only if one or more of the input packer are J -even© (logical or), A packet roust 
be available at all ispizis before a® enpue padest is pirodueed., 3 

ESEQ[seqJ: An ESEQ generates a sequoace "scq 19 of events. e o g 0 M 000r r as its output U. flf it 
has as ioiput START* m& mtins segaesee is generated far each evens, packet asrfving at U. The 
sequence is only repeated if the aext event amves at U„ However, if START is bo* connected. 
ESEQ eoastandy repeat ihe sequence. 

Note thai ALU, MUX P DSMUX* GATE and ECOMB functions behave like their equivalents bi classical 
dataflow machines [3, #) a 

3«2 IPaidtaefcbased C<i3mmmiaii5sesK5®^ ^e£w©irfe 

The communication nework of am RDFF can cojmeeR an ouspuis of one object (L e, its respective func- 
tion) to &e iaputfc) of one orseveniB osher objects. T&is i» usuaEiy arfaiwed by busses and switches. By 
placing she f&nc&ons property m the object wwy functions eaa be connected arbi^ariiy up to a 3frmt 
Biaposed by the device size. As mentioned above, all) values aye communicated as packets* A separate 
eooimraicazioa network exists for data and event paekefs* The pac&eis synchronise the funeuons as in a 
dataflow aaadiiae with acknowledge [3] 0 S. s., the function on2y execute? w&en all input packets sore avail- 
able (apzs* frona the nea-stsiet ^eepdons as desedbed above). The function also stalls if she last output 
packet has not beea consumed, Therefore a data-Row graph snapped to an EBFP seif-synchsonizes its 
execution without &he need for exteroa] central. Only if two or mere fusion output (dam or event), are 
conseceed to the sasme fbacSoa feaput (**N to i roiraeetioff ), fee self-syuclHOBzisadon 5s disabled, fl The 
uses- has to ensure than oraly one packet amves at a toone in a correct CDFG. Otherwise a packet might 
get lost, aad the vsdue pesolning from combining S¥iro or hboxs psicfeeta is «Kxdeg?jed. However, a fusaetioira 
omjpm. em be connect m many function inputs ("1 to N connection") without problems. 

There are some specisi cases: 

a A functioa input caxa be prslooged with a dlstiaes value during configmratiost A This packet is con- 
sumed SiSce a nof?ra3 packet coming Tzom another o^ect, 

o A fimctaon inpus can be defined as cvnsmns. M &tis ease, the packed at she input is reproduced 
repeatedly fog eacfo firacgo© exeradqn. 

^Nobb xhai rhfe fanetion is implemented tyy ?he SAND openuor on the XPF^ ► 

^Notsr th«i on XPP 1 ^ e -fST to 1 eonnccaon'* fop €v«ig Urc^Hzcd by the COR function, and for data by just assigning 
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A® EDFP regain^ register delays ia ite daSaSow. Otherwise very Somg combinational delays and asyn- 
chronous feedback Is possible- We assmae that delays are inserted as £fee inputs of some Sanctions (like 
for most ALUs) and m seme routing segments of &s eommuniGafcOB ftet^orfc Nose lihac regaseere change 
the timing, bttRA (he ta^oaafiiy of sco$p?ciCDFG- " \ 

The following HLL features are not supported by she mefixod described here: 
o pomizs operations 

o Mbraty rails, operating system calls (including sian<aarii WO functions) 

o recursive function calls (Note thas non-seeuffsilve Sanction calls cm be eliminated! by function in- 
lisaiag and Rfoesefoire are not considered besc) 

D AH scalar data (types are suovesed to type iae&g©-., Mteger values an? equivalent to gfefcj packed m 
ehe RDPR. An-ays (possibly multx-cEmeasioBaJ) are she only compost© data uypes considered. 

The following addfeiesaS features axis supposed: 

IMPORTS and OUTPOSTS can bg accessed by ths HLL {unctions $?&m&m{nemz, v&fag) and pia- 
ssrz&m(n&ne. value) respectively. 

4L2 Majppbg ©IT BBgfe-LeveS Lsiigmig^ (OassBraes 

Tbis meSnod converts a HXipFOgrasa 80 a CDFG consisting of che RDFF functions defined in Section 3. B . 
Before itas processing saarts* all HLL program assays ans mapped R© 1®FP RAM functions. Aw anny a 
is snapped to KAM RA&fe). If several axtaiys are mapped go the same RAM* an offset is assigned, tee. 
The EAMs are added to an iaftialSy empty CDFG. There most be eoough RAMs of sufficient size for aJi 
pr©gTOT3 assays* 

The CDFG is generated by a traversal of the AST of fixe HLL program, it processes the program state- 
ment by statement and descends tato the loops and conditional ssaseroeats as appropriate. The fbUewiqg 
two pieces of iafanoation are updated at eveyy progsram poirat* during ihe tnmraak 

v STAKT points to an event output of a SDHP function. This eutpiat deliver a 0-went when^r 
program eKeeutfea teaches this program point, At the beginning, a 0-CONSTANT preloaded 
witf* an areas tapas is added to &e CDFG. (It delhets a 0-evrat ian^diateiy after eenfigusatioxO 
START ttatially poaste to is output. This event is used to sun the oveyal! program execution. The 
STA&Fruw signal generaied after a ptogyam pan has finished exeeiafiag is used as new START 
signal for the following program pan s, obsignate terabwlbn of die enUre program. The SJART 

3 !n a prograTO, pm$nm point* 9£Q bst^f^n two sratesaeais or befere the begteing orate Uie end of a prograsi component . 
a loop or a comKUonal stsscmeot. 



Empf 0 zeito06/12/2002 14:54 



Enpf*nr D :416 P,012 



06-DEZ-2002 14; 57 PftT.-PNUI P ptptem iut 

H ' mW " P ' PIETRUK . +49 721 4S93J3S S. 13 



A Method for Compiling High-level Language jfrggsuny w g ^ggpgfigEangftfe PaEa-ffc»Kr ftoces&Pr 7 

even&s guarantee Kfee eseomioia order of the original program js tnaimained wherever fte daia 
dependencies alone are not sufficient This sehednliag scheme is similar lo a one-fax commller 
for digital haxdwase. 

q VARUST is a ?iss*of {verioble, fmmtm-Quxput} The pgfes saap iinseger variables or an^y 
dements to a, CDFG function's output The Saras pair for a variable la VAKLX5T contaisfe the 
output olFthe function which produce die value of this variable valid at the euf?enE program point, 
New pahs g£e always added no the ftone of VARLiST. Tfcs OKprasiosii VARDSF(var) refers so ihc 
funmion-owptzi of the first pair with variable var in VA&UST. 6 

The following subsections systematically list aSl HLL program cornpoaems and describe how they are 
processed, AereEy alteriiog the CDFQ, STAKf and VARUS!. 



&2JI toseggir JS^s^rw suns AstfjgEnmeaiififf 

Ssaighi-line code wfthoax amy accesses caw bs d&miy raxapped no a dam-flow graph. One ALU is 
allocated for each operator m the program* Because of the seSf-syach?oiazanoa of the ALUs, mo explicit 
eentso! or scheduling is seeded. ITjesnefore piocessifiig itee assigamesits does box access or alser STAST. 
The data depsmdenees (as Khey woiald be exposed in die PAG represen^oR of she pmgsam [J]) are 
analyzed through the processing off VARUSX These assignments syrachrooizE Oiemselves through the 
data-flow. The data-driven execution aucomancalXy exploit* ehc avaiiable Snstrucrioa tevel paialleiisns. 

ASJ assignments ewstotfe the sa^u-band side (RHS) or source KpressSea. Has evataten results In a 
poiaier 10 a CDFG object's output (or pseudo-object as defined below). For integer assigameaES, 2be 
ten-hand side (LBS) vsnable or destkadera is eosabiaed wjtfj the RHS rorik ebjea to foiro a sew pair 
{LHS. ye$ul«(RHS)} which is added cm the front of VARUST. 

The sisxpJess sCatenaesn is a coostaot assigned so axn integer 7 
a » S; 

it doesn't change the CDFG, bus adds {a, 5} to the from of VARLIST. The constant 5 fe a ,% pseudo- 
object which osdy holds the value, bu* does not refer to a CDFG object. Wow V&RDEF(a) ^j ua Js 5 ad 
subseqest program poiats before a is redefined. 

Integer assigBmeats caxa aiso combine variables already deftaed aad eowsfeaaiS: 
b « a » 2 3; 

la ihe AST. die RHS b already coaverted to an expression aree. This tyee i$ iransfonned to a combination 
of old aod ne^ CDFG object (which axe added to the CDFG) as follows: Each operator (internal aode) 
of die tree is substiraed by an ALU with toe opcode cotrespeadiag so the opener in Ihe ttee. If « Jeaf 
node is a cosjstati^ the ALU's inpus is dirccdy connected to tha£ constant If a leaf aote is an imeger 
vsnriable v^ ft Joofeed wp in VARLISX L c VARI5EF<var) as retrieved. Then VARDEP(var) (an ou^ut 
of an almady existiag dbjecs in CDFG or a constant) is connected to the ALU's input- The e<tfpun of &e 
ALU corresponding to fee root operator in die eapresskm tree is defi ned as the mub of die RHS„ Finally, 
a aew pair {LHS, residt(RHS)} is added lo VARUST. If «he two assigamenes above are processed, tthe 

*TWs meihcrti ofttSng a VAf&JSTis aelapted Prom sfte Tmnsmegrrfie? C cpmptlef 151 
Ncce shai we 'cse C syntax for tivs* following eitiniplcs. 
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CDPG with two ALUs in Fig. 2 is crated* Outputs occurring in VARLIST are labeled by Roman 
number After These two assignments, VARLIST ^ [{fc I}, {a, 5}l (The front of the list is on the left 
side.) Note that all inputs connected to a constant (whether direct ficom the expression tree or retrieved 
from VARLIST) must be defined as constant. Inputs defined as constants hay? a small c next to (he input 
arrow in Fig- 2L 

Conditional Integjer Assignments 

For conditional if-then-else statements containing only integer assignments, objects for condition eval- 
uation are created BrsL The object event output indicating die condition result is kspi for choosing 
the correct branch result later- Next, both branches arc processed in parallel, using separate copies 
VARLTST1 and VARLISTZ of VARLIST. (VARLIST itself is not changed,) Finally, for all variables 
added to VAKLIST I <?r VAKLI5T2, a new enoy for VaRUST is created (combination phase). The valid 
definitions from VARLI5T1 and VARLX5T2 axe combined with a MUX function, and the correct input 
is selected by the condition result. For variables only defined in one of the two brandies, die multiplexer 
uses the result retrieved fnro the original VARLIST for the other branch. If the original VARLIST docs 
not have an entry for this variable, a special "undefined" constant value is uwl However, in a function- 
ally correct program this value will never be used. As an optimization, only variables live 11} after the 
if-then-else structure need to be added to VARLIST in the combination phase. 9 

Consider the following example: 
i - 7 r - 

a = 3? 

if ii < 10) [ 
a ■» 5; 
c = 7; 

} 

else { 

c = a - X; 
d • Or 

) 

Fig. 3 shows the resulting CDFS- Before the if-then-else construct, "VARLIST = fta, 3}, {U 7}]. After 
processing the branches, for the men branch, VARLIST! = [{c, 7}, {a. 5} t {a. 3}. {i. 7}]. and for the 
else branch, VARLISTZ = [{d, 0>, {c, I}, {a, 3), (i, 7>]. After combination, VARLIST = [{d, II}, {c, 
HI), {a, IV}, {a, 3}. {i„ 7}J, 

Note that case- or switch-statements can be processed, too. since they can - without loss of generality - 
be converted to nested if-then-else statements. 

Processing conditional statements this way does not require explicit control and does not change START. 
Both branches are executed in paiallel and synchronized by the data-flow. It is possible to pipeline the 

dataflow for optimal throughput. 

*Nwc thai the input and output names can be deduced from their posidyn, tf* fifr L Abo note (hat tte compiler front- 

end would tionnulty have subsumed Of? second assignment by b = 13 (constant propagation). Far dte simplicity of xhis 
eaplanatfcn, no fromend oponiizatiom. are considered in this and the following example 

^Definition: A vajiable is Ate at i program point if iw value is read at a swcffiwil reachable from hens without intermediate 
Tedefinidon. 
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Genera! CoMi8so£Ea3 Sfcatameass 

Conditional sxaiemeass containing efeher array accesses (cf. Section 4.2.7 below) or inner loops cannot 
be processed as described in Seaion 4.2.2. Data pastes muss only be seal lo the active bjancla. Thas Ss 
achieved by she hapXeoneniaiiSon shown in Kg- S. similar to the method presented in [A], 

A dataflow analysis is performed to compute used sets use and defined sets de£ [1 J ©f both branches.* 0 
For the curreae VaKUST e&tsfes of aU .variables is JiV « sc (*fee*i&edp) U def{tkenbody) U 
use(efce&o<2y) U «fe/(eJ*e&octy) U use(fceacfer), functions controlled by the IF condition are 

inserted Nose thac an:ow 5 mtfj double lines is Fig. § deaeie eosaections for all variables in IN, aad the 
shaded DEMUJg function stands for several 3DEMUX fianedonso one for each variable in IN. The DB- 
MUX fuactioiBS-foswanS date pac&ets only to Oie seBeeeed branch. New Uses VASLJSTI and VARL8ST2 
are compiled with the respective? outputs of these DEMUX fiancfions. The diea-branch is processed wish 
and the else branch with VARUST2. finally, fche ouRput values are cornbinisti OUT eomi- 
lains the new values for she same variables as m IN. Since only one branch is ever activated there will 
be a conflict due go two packets axriviag sfeni&anisously. The eomMaatioss wili be added so VARU5T 
after *he conditional statement. If the BP execution shall fee pxpeEned. MERGE opcodes for the outptas 
must be inserted, too. They are controlled by she condmoa Kke fee DEMUX Sanctions. 

The following extension with tfespset to [4] is added (dotted lines 5b St 8) in order to control She execu- 
tion as mentioned above with STAR? events: The START input is ECOMB-eembined with the conditio© 
output and connected m the SSL input of the DEMUX functions The START bpuis of thenbody and 
ebebedy are generated from fee ECOMB output sens tferougEa a 1-FJLTER and a O-CONSTANT 11 or 
ttanugh a O-FILTTER, respectively. The overall STAET netD outpus is geaerated by a simple "2 to 1 
connection" of thenbody's and elsebody's .START^ outputs. Wiifa itaa erosion, aybtesmly nestled 
P conditional statements or loops can be handled within thenb^y gmd eBsebody. 

&2*6 WHSLI Leaps 

WHILE loops are processed simBady to the scfeeroe presented in cf. Fag. 9. As b Section 4.23, dou- 
ble Jhie eonflcctiqss and shaded MERGE and DEMU5C fanettens represent duplication fbralJ variables 
in IN. Here a use{wh#eh&iy) u fef[whilcbody) U we(/w2<ater). The WMKLE loop execute as 
follws: In die fira loop ftmiion, the MERGE fancticms select aB input values VARLI5T at loop 
enny (5EL=Q). The MERGE ouipets are ccfaaected to the Sxeader anca tEje DBMVK fencaions. Tf the 
while condiiion is Kmc (SEL=?>, the input vafocs arc forwarded to the whilebody. otherwise to OUT. 
The output values of the while body are fed back to wfoilebcdy's input via the MERGE and DEMUX 
operator as long as the condition is enae. Finally, after the last tertian, they axe forwarded to OUT. The 
ofl^uts arc add^S to the new VARLIST, iz 

Two exiensioas mth aspect to [4] are added (dotted Ikes in Fir. 9): 

- T^ V ^ bte h ™ a OTlcmcnt ted to« in ^ prograro tcgien containing to siaccmenU IS to value fe scad, a triable 
is defined m a statement (or jeegaon) if a value is assigned to it. 
[« The U.CONSTAP4T is isjuired sioee STAKr events must always be ^events. 

J-Note that the MERGE function fop varices notKveatthe loop's beginning ami ite while^cdy^berinnitiB can fee removed 
For these variaotes. only the DBW fundior* ( o oui^l <he «zL to Squirt Also note tte 

.tf. , ^ ons ^ n fe by rfn P le l - 10 J ^onneciioas^ if *o conCgmtiuofi pcom» g«amni^ to packets from 
JNI aiwiva otjvc CicP5MUX> inpvi b^orc fectjback values «riv^ 
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o In {4.J, she §EL input of *e MERGE fractions 5s preloaded with 0. Hence the loop esseeutjen 
begins iianiediarciy and can be executed only once, fesfcead, we coanees the START ircput to fee 
VERGE'S S1L input ("2 to 1 coanecsiea" vdtb the header output). This alSews t© control fee ame 
oftSiesiaxsefufeeloepeRecanSonaadioTOtaRSs. 

o The whilebedy's START iapm is connected so the header ouilpui, sent through a 
OTSTOT combination as above (generates a ©-evens for each Jeep iteration). By ECOMB« 
eorabhssg whilebody's STAiir^ oaqnt wSds the header output for the MERGE fuacssens' 
SEL inpuSs, the seat loop iteration is only stasned after the ptevsous one has finished. The <*hite 
loop's STAISTne* eutpuK Is generated by filtering header ompm, for a Q-event 

With these eatensloas, arbitrarily aeseed coaditeal SKatemesis eg- loops can be handled wishia wbale- 
body. 

FOR loops ase parttetfedy tegular WHILE Joops. Therefore we could handle them as explained above. 
However, ESDFP feaiases fee special cerastes- fuactios! CKT and &&e data packet m^slplksdSon fiaae- 
Sen MDATA which earo be used fer a mots efficient implemeatasoni of FOE loops. IMs new FOR loop 
scheme is shown am Hg. 80. 

A FOR loop is controlled by a eeratfer ©ST. The Bswer bemad (LB), upper bound (03), and focremeas 
(WO expressions 86 evaluated like any other expressions (see Sections 42.1 and 4X7) and connected 
to the respective JapuB. 

As opposed to WHtt^ loop?, a corobinadoa is only required for variables in mi = 

defiforbody), i. e. those darned in ferbsdy." mi does sot contain variables which arc only used 
m ftsbedy, L3, UB, or 3^C. and dees also not contaisi the loop Endest variable. Variables Im INI sie 
processed as la loops, hot the MERGE and JDJEMJU3S functions* SBL input is connected ao 

OJTs W output CThe W output does the inverse of a WHILE loop's header outpes; la ousels a 1= 
evens after the counter has terminated. Therefore the wputs. of the funcdoos and she outputs 

of the DEMUK Siiiaetioias are swapped here, and the MERGE functions' 5 EL inputs are preloaded wish 
B=eves©.) 

?C output provides the amem value of the loop index variable, jf the final inde* value is required 
(live) after the FOR 8oop. 5a is fleeted wish a taction controlled by GMTs TJ event output 

(which produces one eveaj for every loop iteration). 

Variables in IN2 m u«e(/^JM») \ d&jijorbody}, 5. e. Uspse defined ou^de the loop and ealy used 
(but wot redefined) inside the loop are handled differently. Unless it is a constant value, the variable's 
input value tf mm VAMMBT) must be fepsodueed in each toop iteration since k & eonsusaed in each 
* le ^^ feeywS3S & e Js »P w ould sail fiom the second iteration onwards. The packets ate reproduced 
by MPATA tactions, wwh the SEL h,pq B eennected to CHTs U ouipus, The SBL inputs rausa be 
preloaded with a 1-evena eo select the fins input. The 1- event provided by the lass itesadan select a new 
value for ihe bssm gseemaea of che eutfoa toop. 

J^TiSSSS^F^" 5 ^ ^ ^ te 1 eeaa^eas- « WHILE tops ifthe CTRfigu^ea 
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The fofllOTflfag coa&ol events (dotted lines isa Fig» 10) axe similar to the WHILE loop extensions* bm 
sinaplfliTo CNTs START input is comected to the loop's ©veraH STARTsigoaL S2'ABT wra Is generaied 
Sons CNTs W output, sent trough a l-FXLTBR and D-CONSTANT crfTs V outpu* produces one a- 
event for each Hoop iteration? and is therefore used as forbody's START. Finally. CNTs NEXT input is 
eonneoed to fo?body p s 5TART nc w output- ^ 

For pipelined loops (as defined below in Section 4.2.6), loop iEesmSoas are allowed 19 ©veriiapo Therefore 
CNTs input seeds net be connected. New she counter produces wide* variable values and eoatimi 
evenL* as fee as they can be consumed. However, m Ms> case CNTs W qmpw m not sufficient as overall 
STJUZFvsm QtfCput since the counter sesmiaates before the Jasa iteration's forbody Brashes. Instead, 
STARTnctv « generaKsd from CNTs U otajpufc HCOMB-combiattd with ferbody W S JTA^racw OUEpuS, 
sent through a l-FXlXBR/(J-CO?3STAWTeomKaaiioiJ- The ECOMB produces an event after tenaimaElan 
of eaefo loop iteration, bus only the less evens is a l-evens because ©aJy thsflasi output of CNT*s U oueput 
is a I -evens. Hence this event mdicaies that the last iteration has finished, Cf. Secdan 4-3 for a FOR loop 
example compilation ^dth w& washout pipelining. 

As for WHILE Seeps, these methods a$ow no proems arbi&adiy nested loops and) conditional stacsmetrcs. 
The ftrilawfoig advaatsages over WHILE loop implementations are achieved: 

o One 5nde* vadable value is generated by the CNT function each clock cycle. This is faster and 
smaller (ban the WHILE loop Implementation which allocates a MERGE/DEMUXf ADD loop and 
a coaapasatcsf for ttw counter functionality. 

o Variabiles in XN2 (Only ased in forbe^y) are reproduced in die special MD/&TA functions and need 
sol go through a MERGE/DEMUX loop. This is again faster and smaller than the WHILE loop 
uuplemeniaEion. 

Tte method described so far generates CDFGs pcrfomiing the HU. program's functionally on an RDFR 
However, she program execution 5s unduly se^uentMizedl by the START signals. Sri seme eases, inner- 
most loops caa be vectorized. This means tliai loop iterations can overlap, leading l© a pipelined gataHoy? 
through die operators of die loop body. The Pipeline VeeeorizQt&m technique [6] can be easily applied to 
&e eosapaSadoa method presented here. As mendosed above, for FOR loops 0 the CNTs NEXT input is 
removed so *a COT counts continuously, thereby overlapping the loop iteration?. 

All loops without stray accesses em be pipelined since the dataflow automatically synchronizes lvop° 
carried dGp&zdmces, L e. dependences between a SEatement m one iteraiion and anodter statemenu an a 
subsequent iteration- Loops with amy accesses esn be pipeliwd if the array e, RAM) accuses do 
not cause loop^med dependences or can be transformed to such a fonn„ In this case no RAM address 
is written in one and sead in a subsequent Ucf&Uqn. Therefore the read and *rite aceesses ed die same 

may overlap- This degsee of fieedom is exploited in die SAM access technique desedbed below. - 
Especially foir dual-ported RAM leads to considerable petfenmance improvemeae. 

^.2,7 Amy Assesses 

In contrast to scalar variables, amy accesses have 10 be controlled explicitly in cyder to maintain the 
program's com« execution oxder. As opposed to noxmal dasaflow machine models [3J, a RDF? does 
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aoi have a single address sjeace. fiistead, arrays are allocated so several RAMs. This leads ks> a 
diffemtf appjroaeh to handling RAM accesses and opens up uw opposreuoaeies for opsisaigatioxsL 
Tp fleduce ifie corapfestfg of she compilation process, aray accesses are processed & ew© phases. Phase 
I uses "^eiwScwftaaefieas- for RASfl read and write accesses, A RAM read function foasaRDcgaia inpm 
(read addsess) sad an OUT daa empu* (read vatee), end 3 RAM faction feas WR and- JN dasa 
Inputs (wriue address and writs value). Boift fixRciions are Kabsted wnfo the asxay fee access refers w„ sad 
both have a START evem input aad a U even! outpuL The ewas cowoS die access enier. la Phase 2 aD 
accesses £0 (he same axe rambin^ and substituted l*y a single RAM fwicdon as shown in Fig. 1. 
TSaw tavoXvea ffisaaip«Iafi©g flse daia aad event Sapugs aad outputs such tfeag fee correct exeemEfoa es^feP fe 
ssauita&ned aad she ouepms are forwarded so the correal p&gt of the CDFCo 
*•* 

IPffitsseS Sinwaraays are allecatedto several lUMs, ooly aeeesse £s» &e s^e ba^ w be syn- 
ehronized. Accesses to diffegeut RAMs can oecw eoseusrently or even out of order. Tn ease ef data 
dependences, the aeeesses seJf-synchronize automatically. Withfej pipelined loops, not even tead sad 
ws«e aeeesses so she same RAM have to be synchronised. This is achieved by raaintaiaiiag separate 
START signals for every RAM or evea separate START signals for RAM read and RAM write accesses 
la pipelined loops. At the end of a basic block [JJ M , all STA£CP ncvi outpaS most be combined by a 
ECOMB w provide a START signal tm the aext basis bteek which guarantees Khat all areay aseessss in 
the previous basic blocfe are completed. For pipelined loops, (his condMen can even be relaxed Only 
after -the loop extt all accesses have io be completed. The individual loop itesadeus need get be synchro- 
Eased. 

First she RAM addresses are computed The compiler fronamd's standard transfotsaatioB for anay ac- 
cesses cast be used, and a CDPC Function's output is generated which provides die address. If applicable, 
the eSset with sespeet to the RDFP RAM (as detewnined in the initiaJ snapping phase) must be added. 
This output is connected to the pseudo RAM reads RD input (for a read access) or to Ihe pseado RAM 
urate's WR inpuE (fiw a write access). Additionally, the OUT ouqpta f/ead) o? IN input (writs) is ces- 
aected. The START input fc ceaneaea to the variable's START signal, and the U output is used as 
STAISX'nea for the aext aceess. 

lb avoid redundastf read aeeesses, RAM sends are also registered in VaRLIST. Instead of an integer 
variable, an array element is used as ftost element of me pair. However, a change in a v^bfe oceuj^ag 
in am array annex invalidate the informadon in VARLXST. It must then be removed from in. 
The foSSowisg example wiE h ewo read accesses oon3p3es totbelaBSEmediaSe C5FO shown in Fie. 12. The 
START sjgnals refer only to variable a. STOP? is the event connection which synchronizes she accesses. 
Inputs START (old), i and j should be substituted by the actual outputs resulting from the prograsn before 
the assay leads. 

x - aE±); 
z a x * y? 

Hg. 13 shows (the translation of the following write access: 
atil - x; 

W A fr«/c iWiss pwgnnv part witu aringte eBay ^ a Sng , e ^ ^ j. fe a ^ rf 5^^.,^ ccite . 
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IPfcass 2 We H0v* meyge the pseudo-factions of aH assesses to the same 'RAM and substitute them by 
a single RAM ftmetioa. For all da^ inputs (RD for read access and WR and IN for write agcess) y GATEs 
are iTOened becweeim the Input and the RAM function Their E inputs are connected to the respective 
START inputs of the original psetadcKfunefieas. Ufa RAM is read and written at only one psogram pornm 
the U ouipue of the read aad write access is moved to the ERD or EWR output, respectively. For ©sample, 
the single access a til - from Kg- 13 is tyansfomaed to the final CDFG shown in Kg. 5. • 

However, if several zead or several wriee accesses (L g. pseudo-fissctions From different program points) 
' to the same RAM occur, due EfcJD or EWR cvctks are aos specific anymore- But a ST-AiST,^ event of 
the origins! pseudo Sanction should orally be generated for the respective pffagram point i. e. For She oaf* 
mtz access. This 5s achieved by connoting ihe START signals of all mher accesses Cpseado-functsssu;} 
of the same eype fread or write) wigh the inverted STABT signal of die cinrent access. The resum- 
ing signal produces an event for every access,, but only for she current access a R-event, IMs event is 
ICOMB-corabined with the RAM's ERD or EWR o&apUK. The ECOMB°s output will only occur after 
the access is completed. Because 1COMB OR«combises i«s evens packets, only die cumsm access pro- 
duces a Uevent. Nbtf, Shis evemt is Steered with & I-ALXER and changed by a O-CONSTANT, resulting 
5n a ST AKF n <a> signal whlefr produces a 0-event oafly after the cwnent access i$ completed as requaxedL 

For several accesses^ several sources arc conneciwd id the KD, WR and UN jnpuw of a RAM- This disables 
the self-synchronization. However, siaee gnly one access occurs as a time, fee GATES ©n]y allow one 
&sm pacltes to amve as tfxe inputs. 

For read accesses, the packets at *he OUT ou*puu feee die same problem as the £RD evenic packets: 
They occur for gvejy read access, but must only be used (and forwarded c© subsequent operators) for 
tfea current access. IMs be achieved by connecting she OUT output vua s& DE&fiUX function. The Y 
omput of the DEMUXas used, and eheX output is left unconnected. Then it acts as a selective gate which 
onfiy femanfe pac&ets if its SEL inpn* jnseeives a I -event, and discards its data input if SSL receives a 
O-eveniL The signal created by &e SCOMB described above for the STA£&^ signal creates a I -event 
for she eusreHt access, and a 0-even& otherwise. Using as the S1L inpuS. achieves exactly the desked 
funcgouality. 

Fig. 4 shows she resulting CDFG for Sae first example above (wo read accesses), after applying the 
transformations of Phase 2 to Fig. 12, STOP I is now generated, as folNs: STAKT(old) is invested. 
"2 so I connected" to STOP1 (bemuse il is the START input of the second- read pseado-function), 
ECOMB-eombiaed with RAM's BRD ouKput and sent through the 1 -HLTER/0-CONSTANT combina- 
taon. 5TART(^)is generated simHarly, but h^e START(old) b diyocdy used and STOPB ravened. Hie 
GATEs for input* XN (i and jj) affe eonaecied to START(oId) and STOPL respectively, and the DBMUX 
funcdons for outputs x and y are connected to die ECOMB outputs relaaed to STOP1 and STASX(n^ 

Multiple write accesses use the same conftoi events 9 but instead of one GAIE per access for die RD 
tapuics, one GATS tor WR and one gase for XN (with the same E input) sot used. The EWR output as 
processed like tfoe EED output for ffead accesses. 

This tiraasfemaados %n&m% thai aPl RAM accesses are executed eojrecfly, but it is noi very fast since isad 
or write accesses to the same RAM are not pipelined. The next access only starts after the previous om 
is completed, even if the RAM befo$ used has several pipeline stages. This inefficiency can be removed 
as follws! 

First continuous sequences of either read accesses or write aeeesses (mot mixed) within a basie block are 
delected by cheeking for pseudo-functions whose V outpm is dijrectly connected to the START Input of 
another pseudo-function of she same RAM and die same type (read or write). For these sequences, is is 
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possible to scream dasa into the RAM ratter to waitiag for the previous aeeess t& complex For dais 
purpose, a eombma&on of MERGE fisicfioas selects the KD ok- WR and SN inputs in die order gave© 
by she sequence. The M33*GEs mmst be cotaoroUed by fterative ESEQs guaranteeing that she fcxputs aw 
oaly fomvaxded in the desired order. Theft ©sly the &sfc access m Che sequence needs fee eemroMed by 
a G/Qg qr G&TEs. Samiiarly. she OUT onsets of a read access cam be distributed more effiriesuJy for 
a $equeae&> A cosab&safioa of DEMUX furors with fee same ESEQ control cote be ias«L !& Ja raosE 
effietesw m asrasge the MERGE asd DEMUX fwfleigoas as balanced binary trees. 

Use 5TAi^jtOT Sigaal! is gesjerared as follows: For a sequence of length n„ the STAOT sagnal of the 
entire sequence is replicated sa toes by as SSEQ[00..1] ftsjaedon with the START anptil eOTraee&erf s© 
she se^encc's START Its output is directly to I connected 1 * with the other accesses' STAST siigaa3 
(for single accesses) or ESEQ ©«£pufc seal through OuCONSXAKT (for access sequences), ECOMB- 
connected to EWR op ERD, {respectively, and sem through a 1-FILTEKTO-CONSTANT combinafioo, 
similar to the basis method describe above* Since oaly the lass ESEQ onsput is a only the 

last &&M access generates a STAKFncw as require AlleswaixvcJy, for read accesses, She geseracsea 
of the last output eaa be- sens through a GATE (without she E- mpwt connected), thereby producing a 
5TAf8J* Bfi » eveafc, 

Pig. !4 stows the optimised version of she fim example (Fsguires IZ and 4) usiag the £S£Q-*nethcd for 
geaeratDng STAJS^ew. «w! Fig. 6 shorn the final CDK5 of fee RdMvsg. lasger gsampJe wida esee 
array reads. Here She krcer method for prodacigjg ita 5T^JET nC iEj evens is used. 

g « a[k); 

3f several yead sequences or read sequeaces aad single road accesses occur far &e same RAM, i-eveae 
for detgedftg she cwrratf accesses mm% be g^aerated for seqtsenees of read accesses. They are seeded 
co separate tfee QUT-vaJues Rasing no se^araoe sequences. The SS5Q -output just defined, sens through 

a l^OONSTft^X achieves this. K 5s again 'TJ m I eeaaeatedT so nhi ©to" access^ 9 STAKT soga^s 
(fo? single accesses) or 1SEQ outputs sesw thyoiisgja 0-CO^STANT (for access sequences). The yesuit&ag ' 
event as used to. contsoE a fim-siage DEMUX which is isisensd m sdec& the relevant OUT <H£pti& data 
packets off she sequence as cteseribed above for the basic mefeod. Refer ro fee second cj&amplc (Figomes 
IS and &6) la Sectso© 43 for a complete example 

Inpux aad owipaai pons are process^ similar to vector accesses. A read from a® rapm pon 5s late aa 
aro^ read wfeheise aa address- T&e input dasa pactes fe seat m PEMWS tocdoas whicfii sead It 19 due 
corwi sisbseqwesat operators^ The STOP signal 5$ generated in the same way as described above for 
RAM accesses by eombteing fee S^POKT's V output with the current add other START signals, 

Ouspwt ports eosieol die data pacfees by CATEs like array wise acGesses, The STOP sigeal Is also 
created as fes» RAM accesses- 
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43 More Examples 

Fig. 7 shows the generated CDFG for the following for loop, 
a =» b + cj- 

for (i»0; i<=10; { 
a = a + i; 
- xtij « 
) 

In (his wampU^l = {o> and INZ = {*} (cf. Fig. 10). The MERGE function for variable a is 
replaced by a 2:1 data connection as mentioned in the footnote of Section 4.2J. Note ihat only one 
data packet arrives for variables b, c and k, and one final packet is produced for a (out), fcrbady does 
not use a START event since both operations (the adder and the RAM -write) are dataflow-controlled 
by the counter anyway. But the RAM's EWR output is the forbodv's ST AFT ^ and connected to 
CNTs NEXT input. Note that the jnpelining.oprimizanofl, cf. Section 4.2.S, was not applied heie. If it 
is applied (which is possible for this loop). CNTs NEXT input is not connected, cf. Ra. II. Hoe, the 
loop iterations overlap. ST ART ^ is generated from CNTs U Output and forbedy's START**, (i.e. 
RAM's EWR output), as defined at tne end of Section 4.2.5. 

The following program contains a veciorieable (pipelined) loop with one write access U? array (RAM) x 
and a sequence of two read accesses to array (RAM) y. After the loop, another single read access to y 
occurs. 

Z = 0; 

for (i=0; i<«!0; { 
- i? 

z .- z + yti] + y(2*i); 

) 

a * y[fc]? 

Fig. 15 shows the intermediate CDFG generated before the array access Phase 2 transformation is ap- 
plied. The pipelined loop is controlled as follows: Within the loop, separate START signals for write 
accesses to x and read accesses to y are used. The reentry to the forbody is also controlled by two in- 
dependent signals ("cycler and "cycle2"). For the read accesses, -cy*&r guarantees that the read y 
accesses occur m the comet order. But the beginning of an iteration for read y and write * accesses is 
not synctaonieed. Only at loop exit all accesses must be finished, which is guaranteed by signal "loop 
finished .The single read access is comptetely independent of the loop. , 
Fig lo shows die final CDFG after Phase 2. Note that "cyclel" is removed since a single write access 
needs no additional control, and "cycle2" is removed since the inserted MERGE and DEMUX functions 
automatically guarantee the correct execution Order. The read y accesses are not independent anymore 
since they all refer ^to the same RAM, and the functions have been merged. ESEQs have been allocated 
w> control the MERGE and DEMUX functions of the read sequence, and for the first-stage DEMUX 

^ C SS^ , , S S2L* e ?** OUT valu " for * e ^d ^uence and for the final single read access. . 
The ECOMBs, 1-HLTERs, 0-CONSTANTs and 1-CONSTANTs are allocated as described in Section 
4.2,7. Phase 2. to generate correct control events for the GaTEs and DEMUX fiincdons. 
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